Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling

نویسندگان

  • Julián Luengo
  • Alberto Fernández
  • Salvador García
  • Francisco Herrera
چکیده

In the classification framework there are problems in which the number of examples per class is not equitably distributed, formerly known as imbalanced data sets. This situation is a handicap when trying to identify the minority classes, as the learning algorithms are not usually adapted to such characteristics. An usual approach to deal with the problem of imbalanced data sets is the use of a preprocessing step. In this paper we analyze the usefulness of the data complexity measures in order to evaluate the behavior of undersampling and oversampling methods. Two classical learning methods, C4.5 and PART, are considered over a wide range of imbalanced data sets built from real data. Specifically, oversampling techniques and an evolutionary undersampling one have been selected for the study. We extract behavior patterns from the results in the data complexity space defined by the measures, coding them as intervals. Then, we derive rules from the intervals that describe both good or bad behaviors of C4.5 and PART for the different preprocessing approaches, thus obtaining a complete characterization of the data sets and the differences between the oversampling and undersampling results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Sampling Techniques for Learning an Imbalanced Data Set

This paper presents the performance of a classifier built using the stackingC algorithm in nine different data sets. Each data set is generated using a sampling technique applied on the original imbalanced data set. Five new sampling techniques are proposed in this paper (i.e., SMOTERandRep, Lax Random Oversampling, Lax Random Undersampling, Combined-Lax Random Oversampling Undersampling, and C...

متن کامل

Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning

In recent years, mining with imbalanced data sets receives more and more attentions in both theoretical and practical aspects. This paper introduces the importance of imbalanced data sets and their broad application domains in data mining, and then summarizes the evaluation metrics and the existing methods to evaluate and solve the imbalance problem. Synthetic minority oversampling technique (S...

متن کامل

Data Preprocessing for Liver Dataset Using SMOTE

-The class imbalanced problem occurs in various disciplines when one of target classes has a small number of instances compare to other classes. A classifier normally ignores or neglects to detect a minority class due to the small number of class instances. It poses a challenge to any classifier as it becomes hard to learn the minority class samples. Most of the oversampling methods may generat...

متن کامل

Comparison of Data Sampling Approaches for Imbalanced Bioinformatics Data

Class imbalance is a frequent problem found in bioinformatics datasets. Unfortunately, the minority class is usually also the class of interest. One of the methods to improve this situation is data sampling. There are a number of different data sampling methods, each with their own strengths and weaknesses, which makes choosing one a difficult prospect. In our work we compare three data samplin...

متن کامل

Guest editorial: special issue on "Intelligent Systems, Design and Applications (ISDA'2009)"

This special issue includes eight papers focused on recent developments in the field of intelligent systems. The issue is originated from selected high-quality contributions to the ninth international conference on Intelligent Systems Design and Applications, held in Pisa, Italy, 30 November to 2 December 2009. The selected contributions were expanded and subsequently peer-reviewed. Ten were th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Soft Comput.

دوره 15  شماره 

صفحات  -

تاریخ انتشار 2011